scheduling decision
Evaluating the Efficacy of LLM-Based Reasoning for Multiobjective HPC Job Scheduling
Jadhav, Prachi, Jin, Hongwei, Deelman, Ewa, Balaprakash, Prasanna
High-Performance Computing (HPC) job scheduling involves balancing conflicting objectives such as minimizing makespan, reducing wait times, optimizing resource use, and ensuring fairness. Traditional methods, including heuristic-based, e.g., First-Come-First-Served (FJFS) and Shortest Job First (SJF), or intensive optimization techniques, often lack adaptability to dynamic workloads and, more importantly, cannot simultaneously optimize multiple objectives in HPC systems. To address this, we propose a novel Large Language Model (LLM)-based scheduler using a ReAct-style framework (Reason + Act), enabling iterative, interpretable decision-making. The system incorporates a scratchpad memory to track scheduling history and refine decisions via natural language feedback, while a constraint enforcement module ensures feasibility and safety. We evaluate our approach using OpenAI's O4-Mini and Anthropic's Claude 3.7 across seven real-world HPC workload scenarios, including heterogeneous mixes, bursty patterns, and adversarial cases etc. Comparisons against FCFS, SJF, and Google OR-Tools (on 10 to 100 jobs) reveal that LLM-based scheduling effectively balances multiple objectives while offering transparent reasoning through natural language traces. The method excels in constraint satisfaction and adapts to diverse workloads without domain-specific training. However, a trade-off between reasoning quality and computational overhead challenges real-time deployment. This work presents the first comprehensive study of reasoning-capable LLMs for HPC scheduling, demonstrating their potential to handle multiobjective optimization while highlighting limitations in computational efficiency. The findings provide insights into leveraging advanced language models for complex scheduling problems in dynamic HPC environments.
- North America > United States > Tennessee > Anderson County > Oak Ridge (0.40)
- North America > United States > Tennessee > Knox County > Knoxville (0.40)
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- (3 more...)
- Energy (1.00)
- Government > Regional Government > North America Government > United States Government (0.64)
HRS: Hybrid Representation Framework with Scheduling Awareness for Time Series Forecasting in Crowdsourced Cloud-Edge Platforms
Zhang, Tiancheng, Zhang, Cheng, Liu, Shuren, Wang, Xiaofei, Huang, Shaoyuan, Wang, Wenyu
With the rapid proliferation of streaming services, network load exhibits highly time-varying and bursty behavior, posing serious challenges for maintaining Quality of Service (QoS) in Crowdsourced Cloud-Edge Platforms (CCPs). While CCPs leverage Predict-then-Schedule architecture to improve QoS and profitability, accurate load forecasting remains challenging under traffic surges. Existing methods either minimize mean absolute error, resulting in underprovisioning and potential Service Level Agreement (SLA) violations during peak periods, or adopt conservative overprovision-ing strategies, which mitigate SLA risks at the expense of increased resource expenditure. To address this dilemma, we propose HRS, a H ybrid R epresentation framework with S cheduling awareness that integrates numerical and image-based representations to better capture extreme load dynamics. We further introduce a Scheduling-A ware Loss (SAL) that captures the asymmetric impact of prediction errors, guiding predictions that better support scheduling decisions. Extensive experiments on four real-world datasets demonstrate that HRS consistently outperforms ten baselines and achieves state-of-the-art performance, reducing SLA violation rates by 63.1% and total profit loss by 32.3%. Our code is available at [28].
InterQ: A DQN Framework for Optimal Intermittent Control
Aggarwal, Shubham, Maity, Dipankar, Başar, Tamer
In this letter, we explore the communication-control co-design of discrete-time stochastic linear systems through reinforcement learning. Specifically, we examine a closed-loop system involving two sequential decision-makers: a scheduler and a controller. The scheduler continuously monitors the system's state but transmits it to the controller intermittently to balance the communication cost and control performance. The controller, in turn, determines the control input based on the intermittently received information. Given the partially nested information structure, we show that the optimal control policy follows a certainty-equivalence form. Subsequently, we analyze the qualitative behavior of the scheduling policy. To develop the optimal scheduling policy, we propose InterQ, a deep reinforcement learning algorithm which uses a deep neural network to approximate the Q-function. Through extensive numerical evaluations, we analyze the scheduling landscape and further compare our approach against two baseline strategies: (a) a multi-period periodic scheduling policy, and (b) an event-triggered policy. The results demonstrate that our proposed method outperforms both baselines. The open source implementation can be found at https://github.com/AC-sh/InterQ.
- North America > United States > Illinois > Champaign County > Urbana (0.14)
- North America > United States > North Carolina > Mecklenburg County > Charlotte (0.04)
Dynamic Operating System Scheduling Using Double DQN: A Reinforcement Learning Approach to Task Optimization
Sun, Xiaoxuan, Duan, Yifei, Deng, Yingnan, Guo, Fan, Cai, Guohui, Peng, Yuting
- In this paper, an operating system scheduling algorithm based on Double DQN (Double Deep Q network) is proposed, and its performance under different task types and system loads is verified by experiments. Compared with the traditional scheduling algorithm, the algorithm based on Double DQN can dynamically adjust the task priority and resource allocation strategy, thus improving the task completion efficiency, system throughput, and response speed. The experimental results show that the Double DQN algorithm has high scheduling performance under light load, medium load and heavy load scenarios, especially when dealing with I/O intensive tasks, and can effectively reduce task completion time and system response time. In addition, the algorithm also shows high optimization ability in resource utilization and can intelligently adjust resource allocation according to the system state, avoiding resource waste and excessive load. Future studies will further explore the application of the algorithm in more complex systems, especially scheduling optimization in cloud computing and large - scale distributed environments, combining factors such as network latency and energy efficiency to improve the overall performance and adaptability of the algorithm. In modern computing environments, the operating system serves as the critical intermediary between computer hardware and applications. The efficiency of its scheduling algorithms plays a crucial role in system performance.
Deep Reinforcement Learning-Based User Scheduling for Collaborative Perception
Liu, Yandi, Liu, Guowei, Liang, Le, Ye, Hao, Guo, Chongtao, Jin, Shi
Stand-alone perception systems in autonomous driving suffer from limited sensing ranges and occlusions at extended distances, potentially resulting in catastrophic outcomes. To address this issue, collaborative perception is envisioned to improve perceptual accuracy by using vehicle-to-everything (V2X) communication to enable collaboration among connected and autonomous vehicles and roadside units. However, due to limited communication resources, it is impractical for all units to transmit sensing data such as point clouds or high-definition video. As a result, it is essential to optimize the scheduling of communication links to ensure efficient spectrum utilization for the exchange of perceptual data. In this work, we propose a deep reinforcement learning-based V2X user scheduling algorithm for collaborative perception. Given the challenges in acquiring perceptual labels, we reformulate the conventional label-dependent objective into a label-free goal, based on characteristics of 3D object detection. Incorporating both channel state information (CSI) and semantic information, we develop a double deep Q-Network (DDQN)-based user scheduling framework for collaborative perception, named SchedCP. Simulation results verify the effectiveness and robustness of SchedCP compared with traditional V2X scheduling methods. Finally, we present a case study to illustrate how our proposed algorithm adaptively modifies the scheduling decisions by taking both instantaneous CSI and perceptual semantics into account.
- North America > United States > California > Santa Cruz County > Santa Cruz (0.14)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Telecommunications (0.66)
- Transportation > Ground > Road (0.49)
Dependency-Aware CAV Task Scheduling via Diffusion-Based Reinforcement Learning
Cheng, Xiang, Mao, Zhi, Wang, Ying, Wu, Wen
In this paper, we propose a novel dependency-aware task scheduling strategy for dynamic unmanned aerial vehicle-assisted connected autonomous vehicles (CAVs). Specifically, different computation tasks of CAVs consisting of multiple dependency subtasks are judiciously assigned to nearby CAVs or the base station for promptly completing tasks. Therefore, we formulate a joint scheduling priority and subtask assignment optimization problem with the objective of minimizing the average task completion time. The problem aims at improving the long-term system performance, which is reformulated as a Markov decision process. To solve the problem, we further propose a diffusion-based reinforcement learning algorithm, named Synthetic DDQN based Subtasks Scheduling, which can make adaptive task scheduling decision in real time. A diffusion model-based synthetic experience replay is integrated into the reinforcement learning framework, which can generate sufficient synthetic data in experience replay buffer, thereby significantly accelerating convergence and improving sample efficiency. Simulation results demonstrate the effectiveness of the proposed algorithm on reducing task completion time, comparing to benchmark schemes.
- Asia > Middle East > UAE > Dubai Emirate > Dubai (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Asia > China > Beijing > Beijing (0.04)
Bayesian Counterfactual Prediction Models for HIV Care Retention with Incomplete Outcome and Covariate Information
Oganisian, Arman, Hogan, Joseph, Sang, Edwin, DeLong, Allison, Mosong, Ben, Fraser, Hamish, Mwangi, Ann
Like many chronic diseases, human immunodeficiency virus (HIV) is managed over time at regular clinic visits. At each visit, patient features are assessed, treatments are prescribed, and a subsequent visit is scheduled. There is a need for data-driven methods for both predicting retention and recommending scheduling decisions that optimize retention. Prediction models can be useful for estimating retention rates across a range of scheduling options. However, training such models with electronic health records (EHR) involves several complexities. First, formal causal inference methods are needed to adjust for observed confounding when estimating retention rates under counterfactual scheduling decisions. Second, competing events such as death preclude retention, while censoring events render retention missing. Third, inconsistent monitoring of features such as viral load and CD4 count lead to covariate missingness. This paper presents an all-in-one approach for both predicting HIV retention and optimizing scheduling while accounting for these complexities. We formulate and identify causal retention estimands in terms of potential return-time under a hypothetical scheduling decision. Flexible Bayesian approaches are used to model the observed return-time distribution while accounting for competing and censoring events and form posterior point and uncertainty estimates for these estimands. We address the urgent need for data-driven decision support in HIV care by applying our method to EHR from the Academic Model Providing Access to Healthcare (AMPATH) - a consortium of clinics that treat HIV in Western Kenya.
- Africa > Kenya > Western Province (0.24)
- Africa > Kenya > Trans-Nzoia County > Kitale (0.04)
- Africa > South Africa (0.04)
- (4 more...)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
- Health & Medicine > Therapeutic Area > Immunology > HIV (1.00)
- Information Technology > Modeling & Simulation (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.66)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.66)
Bridging the Gap between ROS~2 and Classical Real-Time Scheduling for Periodic Tasks
Teper, Harun, Bell, Oren, Günzel, Mario, Gill, Chris, Chen, Jian-Jia
The Robot Operating System 2 (ROS~2) is a widely used middleware that provides software libraries and tools for developing robotic systems. In these systems, tasks are scheduled by ROS~2 executors. Since the scheduling behavior of the default ROS~2 executor is inherently different from classical real-time scheduling theory, dedicated analyses or alternative executors, requiring substantial changes to ROS~2, have been required. In 2023, the events executor, which features an events queue and allows the possibility to make scheduling decisions immediately after a job completes, was introduced into ROS~2. In this paper, we show that, with only minor modifications of the events executor, a large body of research results from classical real-time scheduling theory becomes applicable. Hence, this enables analytical bounds on the worst-case response time and the end-to-end latency, outperforming bounds for the default ROS 2 executor in many scenarios. Our solution is easy to integrate into existing ROS 2 systems since it requires only minor backend modifications of the events executor, which is natively included in ROS 2. The evaluation results show that our ROS~2 events executor with minor modifications can have significant improvement in terms of dropped jobs, worst-case response time, end-to-end latency, and performance compared to the default ROS~2 executor.
- Europe > Germany (0.04)
- North America > United States (0.04)
- Europe > Sweden (0.04)
CoRaiS: Lightweight Real-Time Scheduler for Multi-Edge Cooperative Computing
Hu, Yujiao, Jia, Qingmin, Chen, Jinchao, Yao, Yuan, Pan, Yan, Xie, Renchao, Yu, F. Richard
Multi-edge cooperative computing that combines constrained resources of multiple edges into a powerful resource pool has the potential to deliver great benefits, such as a tremendous computing power, improved response time, more diversified services. However, the mass heterogeneous resources composition and lack of scheduling strategies make the modeling and cooperating of multi-edge computing system particularly complicated. This paper first proposes a system-level state evaluation model to shield the complex hardware configurations and redefine the different service capabilities at heterogeneous edges. Secondly, an integer linear programming model is designed to cater for optimally dispatching the distributed arriving requests. Finally, a learning-based lightweight real-time scheduler, CoRaiS, is proposed. CoRaiS embeds the real-time states of multi-edge system and requests information, and combines the embeddings with a policy network to schedule the requests, so that the response time of all requests can be minimized. Evaluation results verify that CoRaiS can make a high-quality scheduling decision in real time, and can be generalized to other multi-edge computing system, regardless of system scales. Characteristic validation also demonstrates that CoRaiS successfully learns to balance loads, perceive real-time state and recognize heterogeneity while scheduling.
- North America > Canada > Ontario > National Capital Region > Ottawa (0.28)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- Asia > China > Beijing > Beijing (0.04)
- (9 more...)
- Information Technology (1.00)
- Telecommunications (0.67)
Network Contention-Aware Cluster Scheduling with Reinforcement Learning
With continuous advances in deep learning, distributed training is becoming common in GPU clusters. Specifically, for emerging workloads with diverse amounts, ratios, and patterns of communication, we observe that network contention can significantly degrade training throughput. However, widely used scheduling policies often face limitations as they are agnostic to network contention between jobs. In this paper, we present a new approach to mitigate network contention in GPU clusters using reinforcement learning. We formulate GPU cluster scheduling as a reinforcement learning problem and opt to learn a network contention-aware scheduling policy that efficiently captures contention sensitivities and dynamically adapts scheduling decisions through continuous evaluation and improvement. We show that compared to widely used scheduling policies, our approach reduces average job completion time by up to 18.2\% and effectively cuts the tail job completion time by up to 20.7\% while allowing a preferable trade-off between average job completion time and resource utilization.
- Asia > South Korea > Seoul > Seoul (0.05)
- North America > United States > Washington > King County > Seattle (0.04)
- Asia > Middle East > Jordan (0.04)